Output
[1] 8
[1] 0.5
[1] "5"
Module 1.3: Using R
Old Dominion University
By now, you can probably use R like a calculator – adding and subtracting single numbers, etc.
Calling this thing a ‘phone’ is like calling a Lamborghini a… cupholder. An incredibly elaborate cupholder.1
However, there are a lot of features that make R the Lamborghini of calculators.
In R, you can assign names to values (remember, objects). You do this by using either <- or =. Online, when Googling, you may find solutions with both. Despite what you might read, there are differences between the two, but we can ignore those differences for right now.
Why would you want to assign names to values? This allows your code to be much more flexible. Consider the following example.
Naming 5 as x allows us to change x only once, and the entire code will run. This will 1) reduce our effort 2) decrease typos / bugs and 3) increase readability. Now, x is not readable, per se, but this is just an example.
This may seem like a simple point, but it is very important. If you manipulate a variable in any way, but do not re-assign it to a name (same or different), it does not get updated/saved. Consider the following example.
It is important to choose informative names for your variables. Generally, single (or few) character names are easy to type, but can easily lose meaning. Too-long names aren’t great if you need to type them over and over. You will figure out a sweet spot for yourself.
There are some names you cannot use for your variable names, and other names that you simply shouldn’t. For example, you cannot start a variable name with a number. You cannot start names with certain punctuation either. On the other hand, you should not name things after already-used words that are native to R. This will just lead to confusing code. For example, do not name anything mean, because that is already a function name that is native to R.
Learning what is and what is not a good variable name takes time and practice.
So far, we have only worked with single values. Data tends to come in sets of multiple values, like large spreadsheets with columns and rows. Let’s built up to the R version of “spreadsheets”, which are called data.frames. We will touch on each of the following ways to store multiple values:
data.frameIn R, the definition of a vector is a collection of values that are all of the same type. We use a c() to denote vectors. The c stands for combine. Once we have our vector, we can do different things to it. For example, we know how to add two values, but what about a vector and a single value? Or two vectors?
Notice how when we added 10 to vec1, 10 was added to each element of vec1. However, when we added the two vectors, addition was element-wise. If two vectors are of different lengths, R will “recycle” the shorter one to match the longer one.
As a quick aside, R has a really good help functionality. To access this, you need to put a ? in front of whatever you want help with. For example, suppose you need help with the mean function from before.
Running this line (as a reminder: ctrl + enter) will bring you to the function’s documentation.
Here, mean becomes: mean(x, trim = 0, na.rm = FALSE, ...)
x, trim, and na.rm are the function’s arguments. These are inputs, and the function gives you an output.
x is the vector, x <- c(1, 4, 8, 7, 2), you want the mean of.trim is the fraction of observations (elements in the vector) to be removed before taking the mean. You might want to remove the top and bottom 5% of observations since they might be outliers.na.rm is a boolean that will remove NA values for you.R also has different ways (functions) to generate vectors. Here are some shortcuts:
1:4 # This outputs every integer between 1 and 10
rep(1:4, times = 2) # Repeat 1-4 twice
# Function arguments are ordered, so it still works
# even without the "times ="
rep(1:4, 2)
seq(1, 4, by = .5) # Sequence from 1-4 by .5 increments
# 4 numbers drawn from a normal distribution
# with mean 0 and sd 1
rnorm(4, mean = 0, sd = 1)[1] 1 2 3 4
[1] 1 2 3 4 1 2 3 4
[1] 1 2 3 4 1 2 3 4
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0
[1] 1.7049032 -0.7120386 -0.2779849 -0.1196490
What happens if you have a vector of elements that are of different types?
Let’s suppose you only want a part of a vector. You can select elements from vectors by index (it’s position in the vector) or by boolean values. You do this by typing the vectors name, followed by a square bracket, followed by another vector that containing indices or boolean values. Here some examples of how to do this:
this_vec <- c(0, 8, 3, 6, 1, 2, 2, 7, 6)
# Select the 3rd, 4th, and 5th observations
this_vec[c(3, 4, 5)]
# Select every other observation
# Note: there are 9 elements, by R cycles through c(TRUE, FALSE)
# until it gets through all 9.
this_vec[c(TRUE, FALSE)]
# Select observations less than 5.
this_vec[this_vec < 5]
# Select observations less than 5 or greater than 7
this_vec[this_vec < 5 | this_vec > 7]
# Select observations less than 5 and greater than 7
# Notice the output here since the logic is impossible
# Something cannot be less than 5 AND greater than 7
this_vec[this_vec < 5 & this_vec > 7][1] 3 6 1
[1] 0 3 1 2 6
[1] 0 3 1 2 2
[1] 0 8 3 1 2 2
numeric(0)
Another important way to subset vectors is with the %in% operator. Suppose you have a vector of years as follows: c(2006, 2006, 2003, 2005, 2012, 2002, 2016, 2006, 2008). If you were to subset the vector where you only kept elements where years were equal to 2006, 2007, or 2008, you would have to write the following:
[1] 2006 2006 2006 2008
This can be very tedious, is prone to error/typo, and infeasible if the list were much longer (i.e., not just three years). As a shortcut, R has the following:
A collection of vectors (of similar type and length) is called a matrix. Matrices have two dimensions: rows and columns. Matrices look like: example_mat[rows,cols]. You can combine vectors using the rbind() (row-wise) or cbind() (column-wise). Let’s start by assuming you have a few vectors to work with.1
v1 v2 v3
[1,] -0.1239606 1 9
[2,] 0.2681838 2 8
[3,] 0.7268415 3 7
[4,] 0.2331354 4 6
[,1] [,2] [,3] [,4]
v1 -0.1239606 0.2681838 0.7268415 0.2331354
v2 1.0000000 2.0000000 3.0000000 4.0000000
v3 9.0000000 8.0000000 7.0000000 6.0000000
Another way to generate matrices would be to put one giant vector into the matrix function. Of course, you will need to give matrix() a bit of help. You need to tell it something about the dimensions you’d like. This could be ncol for number of columns or nrow for number of rows. In addition, you should specify whether the vector is “by row” or not (i.e. “by column”).
r1 denotes an element belonging on the first row, etc.:c(r1, r1, r1, r2, r2, r2, r3, r3, r3)
c(r1, r2, r3, r1, r2, r3, r1, r2, r3)
[,1] [,2] [,3]
[1,] -0.1239606 1 9
[2,] 0.2681838 2 8
[3,] 0.7268415 3 7
[4,] 0.2331354 4 6
[,1] [,2] [,3] [,4]
[1,] -0.1239606 0.2681838 0.7268415 0.2331354
[2,] 1.0000000 2.0000000 3.0000000 4.0000000
[3,] 9.0000000 8.0000000 7.0000000 6.0000000
Once you have the matrix of your dreams, you may need to access certain columns or rows. Remember: example_mat[rows, columns]. For vectors, if you want the first element, you would use example_vec[1]. For a matrix, example_max[1,] will give you the first row, example_max[,1] will give you the first column, and example_max[i,j] will give you the i\(^{th}\) row and j\(^{th}\) column. To select multiple rows, you can use logic or indices, much like vectors.
matrix(v_mat, ncol = 3) -> mat
mat; cat("\n") # whole matrix
mat[1,]; cat("\n") # first row
mat[,2]; cat("\n") # second column
mat[1,2]; cat("\n") # first row, second column
mat[c(1, 3),]; cat("\n") # first and third row
# Select all rows where the elements in the first column are positive.
# mat[,1] > 0 will return boolean values, and mat[,] will return the rows with TRUE values
mat[ mat[,1] > 0 ,] [,1] [,2] [,3]
[1,] -0.1239606 1 9
[2,] 0.2681838 2 8
[3,] 0.7268415 3 7
[4,] 0.2331354 4 6
[1] -0.1239606 1.0000000 9.0000000
[1] 1 2 3 4
[1] 1
[,1] [,2] [,3]
[1,] -0.1239606 1 9
[2,] 0.7268415 3 7
[,1] [,2] [,3]
[1,] 0.2681838 2 8
[2,] 0.7268415 3 7
[3,] 0.2331354 4 6
Lists are just like vectors, except each element can be a different type. In fact, each element of a list can be an entire vector! Lists, for this reason, are incredibly flexible. Sometimes, this can even make it difficult to work with lists.
An interesting feature of lists, is that you can name the elements within the list. This is possible with vectors as well, but not as useful. Here are some examples of naming and using the names within lists.
$person1
[1] "alex" "cardazzi"
[1] "alex" "cardazzi"
[1] "alex" "cardazzi"
Changing the format of the list a little bit:
[1] "alex" "jalen" "thom"
[1] "brunson" "yorke"
This is a special list because both vectors of the list have the same number of elements, or observations. Effectively, this is what a data.frame is. Really, it is what a spread sheet is – a collection of columns all with the same number of rows.
data.frameSo, what do data.frame’s look like?
data.frameObservations can be accessed in data.frames via the $ or [. These objects combine lists and matrices to make a more realistic view of the types of data that are most common in the real world.
first last num_of_albums nba_seasons phds birth_country
1 alex cardazzi 0 0 1 us
2 jalen brunson 0 5 0 us
3 thom yorke 10 0 0 uk
data.frameSuppose you want to subset the df object that you’ve created. Again, there are different ways to do this. Like matrices, to get some rows and all columns, you would use df[lim,] where lim is a vector of boolean values or indices. Leaving nothing following the comma indicates to R that you want everything in that dimension. To get columns, you can reverse this (df[,3:4]) or use names (df[,c("nba_seasons", "phds")]). If you only want a single column, of course, you can use df$phds.
data.frameAs a final note about data.frames, here are a few important functions:
nrow(): Returns the number of rows in a data.frame.ncol(): Returns the number of columns in a data.frame.colnames(): Returns the names of columns in a data.frame.To write a new script, click on the top left button underneath “File”. You should be able to see a white paper icon with a green +. This will open up a menu of different files. Just select “R Script” for now.
In the near future, we will learn to use R Markdown and/or Quarto Documents as well.
Now, you’re ready to write your first RScript… now what?
For beginners, as a template, the top of your script should look like the following:
library(""): This is where you load any packages you might want. We will discuss packages later.rm(list = ls()): This is how you clear your environment. It is a good idea to start with an empty enviornment so you don’t get confused between what is old and what is new.setwd(""): This is where you set your working directoryDocuments/Econ 311/HW/HW01.html.R has two helpful functions for working directories. First, getwd() tells you where R is currently looking.
Then, if I wanted to change this, I would use setwd(). Here, I could either:
# Two periods means "go back one level"
# So, if we were in "Documents/Econ 311/HW01"
# ".." would bring us to "Documents/Econ 311"
setwd("..")
# If there was another folder inside "HW01",
# for example, suppose you have a folder "Data" inside "HW01"
# From "HW01", you can navigate to "Data" like:
setwd("Data")
# Maybe you want to go from "HW01" to "HW02/Data"
# You would have to back out from HW01 (using "..")
# Then go into HW02 and Data
setwd("../HW02/Data")Of course, to use the ".." trick, you need to know where you’re starting from (i.e. getwd()). If you are unsure, you could always type in your entire working directory in one go.
Now that R knows where to look, we need it to import our data so we can use it. Most of the time in this course, we will use files that end in .csv. This stands for “comma separated values”. .csv files are very common and require relatively small amounts of storage. .csv files are also open-able in Excel (you just might get some warning about how any Excel formulas you write will not be saved). To create a .csv from an .xlsx (Excel) file, just use “Save As” in Excel, and change the file extension to “Comma Separated Values (.csv)”.
Now, let’s read in a file called “ford_escort.csv”. On my machine, it is in the folder “data”. Rather than changing my working directory, I can just put the path of the file. We will use the read.csv() function to import our data.
# Since my working directory is already in the correct folder,
# I can focus on reading the data in.
# Since my working directory is in "econ311/module01",
# But the file is in "econ311/data",
# I need to back out of "module01" and navigate to the data folder
ford <- read.csv("../data/ford_escort.csv")
dim(ford); cat("\n") # dim() gives the number of columns and rows
head(ford) # the head() function displays the first 6 rows.
# I could have also done:
# setwd("data")
# ford <- read.csv("ford_escort.csv")[1] 23 3
Year Mileage..thousands. Price
1 1998 27 9991
2 1997 17 9925
3 1998 28 10491
4 1998 5 10990
5 1997 38 9493
6 1997 36 9991
Now that we have our data read into R, we can manipulate it:
Mileage..thousands. to mileage.range() function to find the minimum and maximum $/mi.As a note, you can use cat() to combine text and code. Put "\n" at the end to make a new line. You can experiment with this on your own.
# colnames(ford) # Take a look at the column names.
colnames(ford)[2] <- "mileage" # change the second name
# ford$Price < 9000 # this gives boolean values.
# Since R treats TRUE as 1 and FALSE as 0, use sum()
cat("Number of Escapes less than $9,000:", sum(ford$Price < 9000), "\n")
ford$mileage <- ford$mileage * 1000 # Multiply by 1000 and save
ford$cost_per_mile <- ford$Price / ford$mileage # Create $/mi
cat("Average Cost per Mile:", mean(ford$cost_per_mile), "\n") # average
cat("Range of Cost per Mile:", range(ford$cost_per_mile)) # min and maxNumber of Escapes less than $9,000: 5
Average Cost per Mile: 0.4315369
Range of Cost per Mile: 0.08325 2.198
An especially attractive feature of R is its powerful graphics. Just Google “Best R Plots”, or something, and you’ll see what I mean.
To start, we’ll learn some of the basics. We will begin by generating two scatter plots using data from ford.
mileage vs Price.mileage vs cost_per_mile.Below contains some examples of additional arguments for the plot() function.
las = 1 rotates the text on the y-axis. Different numbers will rotate it more or lesscol sets the colors used in the plot. This can take a vector of colors.pch sets the type of point used.cex sets the size of the points. The default is 1.main sets the title of the plot.xlab sets the name of the x-axisylab sets the name of the y-axisWe can also add reference lines to the plot, and also make the colors a bit more complex.
# Set all colors as "tomato"
ford$point_color <- "tomato"
# If the Year is less than the mean year, color it "dodgerblue"
# Of course, these are therefore the "older" cars
ford$point_color[ford$Year < mean(ford$Year)] <- "dodgerblue"
plot(ford$mileage, ford$cost_per_mile, las = 1,
pch = 19, cex = 1.2,
col = ford$point_color, main = "Cost vs Mileage",
xlab = "Mileage", ylab = "Price per Mile")
abline(h = 1) # horiz. line at Y = 1
abline(v = mean(ford$mileage)) # vert. line at the mean of XOf course, whenever you choose to add some differences in shapes, colors, etc., it’s helpful to add a legend to your plot. To do this, we can use the legend() function. This function accepts a few important arguments:
bty: setting this to "n" removes the box around the legend. I always use this option.legend: this is the actual text to be displayed in the legend. It accepts a character vector, so if you colored your plot by men and women, you would use c("Men", "Women").x, y: You can specify the exact coordinates of your legend, or you can specify things like: "topleft", "topright", "bottomleft", or "bottomright".horiz: this accepts a boolean value, and turns the legend from vertical to horizontal.pch or lty options to tell R if you want to display points or lines next to your legend.Below is a plat with two legends (which is certainly redundant) to show off some of the different ways to customize the output.
plot(ford$mileage, ford$cost_per_mile, las = 1,
pch = 19, cex = 1.2,
col = ford$point_color, main = "Cost vs Mileage",
xlab = "Mileage", ylab = "Price per Mile")
legend("topright", pch = 19, bty = "n", horiz = TRUE,
legend = c("Old Ford", "New Ford"), cex = 1.5,
col = c("dodgerblue", "tomato"))
legend("bottomleft", lty = c(1, 2), pch = c(2, 19),
legend = c("Old Ford", "New Ford"),
col = c("dodgerblue", "tomato"))When generating figures, you will sometimes need to add data from a different source to the same set of axes. As an example, let’s simply plot the data above, but in two steps instead of one.
To do this, we will use points(). This function accepts nearly every argument plot() does, except you are unable to impact the axes/labels of the plot.
plot(ford$mileage[ford$point_color == "tomato"],
ford$cost_per_mile[ford$point_color == "tomato"],
las = 1, pch = 19, cex = 1.2,
col = "tomato", main = "Cost vs Mileage",
xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color != "tomato"],
ford$cost_per_mile[ford$point_color != "tomato"],
pch = 19, cex = 1.2, col = "dodgerblue")Once you get the hang of using plot() and points() in tandem, you’ll find it convenient that points() does not impact the axes. However, to start, this will be annoying. For example, let’s switch the order of the data in plot() and points().
plot(ford$mileage[ford$point_color != "tomato"],
ford$cost_per_mile[ford$point_color != "tomato"],
las = 1, pch = 19, cex = 1.2,
col = "dodgerblue", main = "Cost vs Mileage",
xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color == "tomato"],
ford$cost_per_mile[ford$point_color == "tomato"],
pch = 19, cex = 1.2, col = "tomato")The plot is different because when plot() is setting the axes, it doesn’t know that you’re planning on using points() next. So, it scales the axes so the data fed into plot() “fits” the space.
To overcome this, we can use the following trick:
plot(0, 0, type = "n",
ylim = range(ford$cost_per_mile),
xlim = range(ford$mileage), # range can include multiple vectors
main = "Cost vs Mileage", las = 1,
xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color != "tomato"],
ford$cost_per_mile[ford$point_color != "tomato"],
pch = 19, cex = 1.2, col = "dodgerblue")
points(ford$mileage[ford$point_color == "tomato"],
ford$cost_per_mile[ford$point_color == "tomato"],
pch = 19, cex = 1.2, col = "tomato")Of course, this is a lot more coding than the initial plot’s code. The idea of showing you this is that, now, you can always make sure your data “fits”. This is one of the little things that I use constantly, but it took me a long time to figure out.
Finally, adding lines to a plot is very similar in that one needs to use lines(). To illustrate, we will examine panel data on cigarette consumption by state (documentation).
I am going to read in the data and plot sales by year.
This figure is very difficult to understand. Let’s trim it down to just a few states.1 In addition, we can add colors to the figure.
This plot can still be improved. It’d be a lot more natural to see the data as lines instead of points. To do this, we can use type = "l".
Notice two things about this plot. First, there’s only a single color. In R, you should think of a line as a single point. R cannot color different parts of line differently, so it will just take the first color (here, it’s 1, which is black). Second, there are these insane diagonal lines across the plot. This is because R wants to connect everything into a single line.
To fix this, we need to use lines like we used points before. This is an example case of a time where we’ll want to set up the axes before we plot anything.
# before, I plotted 0, 0
# now, I am simply keeping the data
# in plot().
# this way, I don't need to set the axes
# via ylim() and xlim()
plot(cig$year, cig$sales,
las = 1, type = "n",
ylab = "Sales", xlab = "Year")
lines(cig$year[cig$state == 1],
cig$sales[cig$state == 1],
col = 1)
lines(cig$year[cig$state == 3],
cig$sales[cig$state == 3],
col = 3)
lines(cig$year[cig$state == 4],
cig$sales[cig$state == 4],
col = 4)
lines(cig$year[cig$state == 5],
cig$sales[cig$state == 5],
col = 5)
legend("bottomleft", ncol = 2,
legend = c("State 1", "State 3", "State 4", "State 5"),
bty = "n", col = c(1, 3, 4, 5), lty = 1)Next module, we’ll learn about “loops”, which will cut down on the amount of code we need to write to generate these lines.
ECON 311: Economics, Causality, and Analytics